Corpus and sentiment analysis
نویسنده
چکیده
Information extraction/retrieval has been of interest to researchers since the early 1960's. A series of conferences and competitions have been held by DARPA/NIST since the late 1980's has resulted in the analysis of news reports and government reports in English and other languages, notably Chinese and Arabic. A number of methods have been developed for analysing `free' natural language texts. Furthermore, a number of systems for understanding messages have been developed, focusing on the area of named entity extraction, templates for dealing with certain kinds of news. The templates were handcrafted, and a lot of ad-hoc knowledge went into the creation of such systems. Seven of these systems have been reviewed. Despite the fact that IE systems built for different tasks often differ from each other, the core elements are shared by nearly every extraction system. Some of these core elements such as parser and part of speech (POS) tagger, are tuned for optimal performance for a specific domain, or text with pre-defined structures. The extensive use of gazetteers and manually crafted grammar rules further limits the portability of the existing IE systems to work language and domain independently. The goal of this thesis is to develop an algorithm that can be used to extract information from free texts, in our case, from financial news text; and from arbitrary domains unambiguously. We believe the use of corpus linguistics and statistical techniques would be more appropriate and efficient for this task, instead of using other approaches that rely on machine learning, POS taggers, parsers, and so on, which are tuned to work for a predefined domain. Based on this belief, a framework using corpus linguistics and statistical techniques, to extract information as unambiguously as possible from arbitrary domains was developed. A contrastive evaluation has been carried out not only in the domain of financial texts and movie reviews, but also with multi-lingual texts (Chinese and English). The results are encouraging. Our preliminarily evaluation, based on the correlation between a time series of positive (negative) sentiment word (phrase) counts with a time series of indices produced by stock exchanges (Financial Times Stock Exchange, Dow Jones Industrial Average, Nasdaq, S&P 500, Hang Seng Index, Shanghai Index, and Shenzhen Index) showed that when the positive (negative) sentiment series correlates with the stock exchange index, the negative (positive) shows a smaller degree of correlation and in many cases a degree of anti-correlation. Any interpretation of our result requires a careful econometrically well grounded analysis of the financial time series this is beyond the scope of this work. Acknowledgments I would like to express my sincere thanks to Professor Khurshid Ahmad, for his encouragement and supervision throughout my research. My thanks also go to the EU co-funded project Generic Information based Decision Assistant (GIDA IST-200031123), and the FinGrid Project (ESRC Project Number RES-149-25-0028) for financial backing. My thanks are further extended to my colleagues, Lee Gillam, Tugba Taskaya, Paulo Oliveria, Saif Ahmad and Hayssam Traboulsi, who have been working in the GIDA project. Thanks also to Dr. Andrew Hippisley who has provided linguistic insights when I began my Ph. D. Most importantly, I thank my parents, my sisters, and, particular thanks due to my friend Pensiri Manomaisupat, in the preparation of this work. Without their patience, understanding and support, it would not be possible for me to finish this thesis. To them, I will always be grateful, and to them, I dedicate this thesis.
منابع مشابه
FB-NEWS15: A Topic-Annotated Facebook Corpus for Emotion Detection and Sentiment Analysis
English. In this paper we present the FBNEWS15 corpus, a new Italian resource for sentiment analysis and emotion detection. The corpus has been built by crawling the Facebook pages of the most important newspapers in Italy and it has been organized into topics using LDA. In this work we provide a preliminary analysis of the corpus, including the most debated news in 2015. Italiano. In questo la...
متن کاملBuilding and exploiting a French corpus for sentiment analysis (Construction et exploitation d'un corpus français pour l'analyse de sentiment) [in French]
Building and exploiting a French corpus for sentiment analysis This work introduces a French corpus for sentiment analysis. We describe the construction and organization of the corpus. We then apply machine learning techniques to automatically predict whether a text is positive or negative (the opinion classification task). Two techniques are used : logistic regression and classification based ...
متن کاملA Twitter Corpus and Benchmark Resources for German Sentiment Analysis
In this paper we present SB10k, a new corpus for sentiment analysis with approx. 10,000 German tweets. We use this new corpus and two existing corpora to provide state-of-the-art benchmarks for sentiment analysis in German: we implemented a CNN (based on the winning system of SemEval-2016) and a feature-based SVM and compare their performance on all three corpora. For the CNN, we also created G...
متن کاملThe ICWSM 2010 JDPA Sentiment Corpus for the Automotive Domain
This paper presents a rich annotation scheme for mentions, co-reference, meronymy, sentiment expressions, modifiers of sentiment expressions including neutralizers, negators, and intensifiers, and describes a large corpus annotated with this scheme. We describe how this corpus relates to recent, state-of-the-art work in sentiment analysis, and define the various annotation types, provide exampl...
متن کاملSentiment analysis methods in Sentiment analysis methods in Persian text: A survey
With the explosive growth of social media such as Twitter, reviews on e-commerce website, and comments on news websites, individuals and organizations are increasingly using opinions in these media for their decision making. Sentiment analysis is one of the techniques used to analyze userschr('39') opinions in recent years. Persian language has specific features and thereby requires unique meth...
متن کاملA Multi-View Sentiment Corpus
Sentiment Analysis is a broad task that involves the analysis of various aspect of the natural language text. However, most of the approaches in the state of the art usually investigate independently each aspect, i.e. Subjectivity Classification, Sentiment Polarity Classification, Emotion Recognition, Irony Detection. In this paper we present a Multi-View Sentiment Corpus (MVSC), which comprise...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007